Pair Trading with Machine Learning

Pair Trading - Introduction

Pair trading is a commonly used quantitative trading strategy used in hedge funds and investment banks. A major aspect of this strategy is the spread analysis/prediction amongst a pair of selected stocks.

The prinicple is as follows.

Let's say you have a pair of securities X and Y that have some underlying economic link. An example might be two companies that manufacture the same product, or two companies in one supply chain. If we can model this economic link with a mathematical model, we can make trades on it.

Process Flow

In [1]:
#Importing  Dependencies
import numpy as np
import pandas as pd
import os
import glob
import pickle as pickle
import json
import missingno
import itertools
from itertools import cycle
from datetime import datetime
from builtins import object
from datetime import date
from dateutil.relativedelta import relativedelta
from collections import defaultdict
from concurrent import futures
import pandas_datareader.data as web
import warnings
warnings.filterwarnings('ignore')

#visualization dependencies
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns
from plotly import graph_objects as go
from plotly import offline as py
import plotly.express as px
%matplotlib inline

#machine learning/statistical dependencies
import statsmodels.api as sm
from kneed import KneeLocator
from sklearn import metrics
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import mean_squared_error, silhouette_score
from sklearn.cluster import KMeans, AgglomerativeClustering, AffinityPropagation
from sklearn.manifold import TSNE
import statsmodels.api as sm
import statsmodels.tsa.stattools as ts
from statsmodels.tsa.arima_model import ARIMA
from pykalman import KalmanFilter
from keras.models import Sequential, Model
from keras.layers import Dense, LSTM, Dropout
import scipy.cluster.hierarchy as shc

Data Gathering

We scrape the web to gather the financial data of S&P 500 companies for last 5 years from current date to be utilized for the proposed pair trading strategy implementation.

Data Sources

  • The ticker, sector and industry metadata for S&P 500 companies was scraped from Wikipidea. (https://en.wikipedia.org/wiki/List_of_S%26P_500_companies)

  • The financial information for each company was gathered using python's pandas_datareader library using the company tickers as identifiers.

We developed the class gather_data, which encapculates all the function utilized for the data gathering process.

Quick overview of the functions:

  • download_s_and_p_stats(): Used to fetch S&P 500 company metadata (ticker, sector, industry) from wikipidea.

  • download_stock_financials(): Used to fetch financial data (open, low, high, close, volume, adj close) given a stock ticker.

  • MultiThreader(): Used to speed up the process of fetching stock data using multi threading.

  • combine_stock_data(): Used to merge all the stock CSV data files into on combined file.

The data gathering process consists of the following steps:

  1. The MultiThreader() is called through a class object. It utilizes the download_s_and_p_stats() to get the stock metadata. This metadata is then passed to the download_stock_financials() with 50 maximum parallel workers.

  2. The download_stock_financials() extracts the ticker from each stock's metadata. This ticker is used with the start and end dates to fetch financial data of the stock for that time period.

  3. The financial data is merged with the metadata for each stock and a CSV file for each stock is created and stored at the given location.

  4. We utilize the combine_stock_data() to merge all the gathered stock CSV files into one dataframe and optionally store it in the given location

In [ ]:
class gather_data(object):
    def __init__(self):
        super(gather_data, self).__init__()
        
        #The start date for historical data
        self.start_time = str(date.today()- relativedelta(years=5))
        
        #The end date for historical data
        self.end_time = str(date.today())
    
    def download_s_and_p_stats(self):
        '''
        Fetches comapny ticker, sector and industry data for
        S&P 500 companies
        '''
        data = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
        df = data[0].reset_index()
        #Extracting required features
        df = df[['Symbol','GICS Sector','GICS Sub-Industry']]
        #Formatting the group symbols
        df['Symbol'] = [sym.replace('.','-') if sym.find('.') != -1 else sym for sym in list(df['Symbol']) ]
        return df

    
    def download_stock_financials(self,stock):
        '''
        Takes a stock ticker as input and fetches related financial data given the start and end time
        
        Inputs
        stock-- The stock ticker
        
        The function generates and stores a csv file for stock financial data under the given path
        '''
        try:
            stock_data = web.DataReader(stock[0],'yahoo', self.start_time, self.end_time)
            stock_data[['Symbol','GICS Sector','GICS Sub-Industry']] = pd.DataFrame([stock], index=stock_data.index)
            output_name = stock[0] + '_data.csv'
            stock_data.to_csv(self.op_path +str(output_name))

        except:
            print('bad: %s' % (stock[0]))

    def MultiThreader(self, op_path):
        '''
        We utilize the ThreadPoolExecutor from concurrent.futures module
        to speed up the downloads utilizing parallelism    
        
        Input
        op_path-- The path for output csv to be stored
        '''
        self.op_path = op_path
        max_workers = 50
        sp_stocks = self.download_s_and_p_stats()
        
        #To handle case when smaller number of stock than thread are passed
        workers = min(max_workers, len(sp_stocks))
        
        with futures.ThreadPoolExecutor(workers) as executor:
            res = executor.map(self.download_stock_financials, sp_stocks.values)
    
    def combine_stock_data(self,curr_path, files_path, output_path = None, save_csv=False):
        '''
        Function to combine csv files for all stocks into one
        
        Inputs:
        curr_path-- Path for Current working directory 
        files_path-- Path where the files to be merged are stored
        output_path-- Path to store the merged CSV file (default=None)(optional)
        save_csv-- Boolean to specificy is CSV file should be stored (default=False)(optional)
        '''
        #changing the working directory to files path
        os.chdir(files_path)

        #fetching all file names with an '.csv' extension
        file_extension = '.csv'
        all_filenames = [i for i in glob.glob(f"*{file_extension}")]

        #combining the files
        combined_stocks_data = pd.concat([pd.read_csv(f ) for f in all_filenames])

        if save_csv == True:
            #changing working directory to the given output path
            os.chdir(output_path)
            #saving our combined csv data
            combined_stocks_data.to_csv('combined_stocks_data.csv')
        
        #chaning work directory back to original
        os.chdir(curr_path)
        return combined_stocks_data
In [ ]:
if __name__ == '__main__':
    #defining working directory paths
    curr_path = os.getcwd()
    files_path = curr_path + '/Data/Individual Stocks/'
    output_path = curr_path + '/Data/'

    #declaring gather_data class object
    gd_obj = gather_data()
    #downloading stock data
    gd_obj.MultiThreader(files_path)
    
    #combining the data files
    combined_data = gd_obj.combine_stock_data(curr_path,files_path, output_path, True)
In [ ]:
#displaying 
combined_data.head()
In [ ]:
#Shape of the combined data
np.shape(combined_data)

Data Preprocessing

The data preprocessing part contains of 2 steps:

  1. Feature Engineering: We apply feature engineering on the data to extract multiple technical indicators which might help us to understand and realize the link between two stocks. For this we make use of generate_technical_indicator() from the feature_engineering class to derive and add aditional features to a given the stock financials data.

  2. Data Cleaning: The NA values are handles and the data is standardised to prepare it for future analysis

Technical Indicators

  • RSI - Relative Strength Index, a momentum indicator used by traders to analyze bullish behaviou when a stock is oversold or bearish when a stock is overbought
$$\text{RSI} = 100 - \frac{100}{1 + RS}$$

$$\text{RS} = -\frac{AvgGain_{prev} * 13 + Current Gain}{AvgLoss_{prev} * 13 + Current Loss}$$

  • KAMA - Kufman's Adaptive Moving Average, accounts for noise and volatile nature of the market. It closely follows price when noise is low and smooths out the noise when price fluctuates. It can be used to identify the turning points and price movements
  • ADI - Accumulation/Distribution index acts as a leading indicator of price movements relative to the volume
$$\text{CLV} = \frac{(close - low) - (high - close)}{high-low}$$

$$\text{accdist} = accdist_{prev} + volume * CLV $$

  • VPT - Volume Price trend is an indicator based on a running cumulative volume that adds or substracts a multiple of the percentage change in share price trend and current volume, depending upon the investment’s upward or downward movements.
$$\text{VPT} = VPT_{prev} + volume * \frac{(close_{today} - close_{prev})}{close_{prev}}$$

  • ATR - Average True Range, is an indicator that provides the intution of the degree of price volatility. $$\text{ATR} = \frac{1}{n}\sum_{i=1}^{n}TR_{i}$$ $$\text{TR} = max[(high-low), abs(high - close_{prev}),abs(close_{prev} - low)]$$
  • Bollinger M-Band - Simple moving average of the price movements
$$\text{BB} = n - MA_{daily} $$
  • EMA - Exponential Moving Average of the price movements
$$\text{EMA} = [Close - EMA_{prev}] * \frac{2}{n+1} + EMA_{prev} $$
  • Average Directional Movement Index - It is an indicator of strength of the trend for a given stock. . It is compose of two indicators, positive direction indicator (+DI) and negative direction indicator(-DI)

  • Moving Average Convergence Divergence - It is a trend-following momentum indicator that shows the relationship between two moving averages of prices in a given time window

  • Daily returns - The percentage gain from previous day's Adjacent closing price

  • Log returns - The percentage gain from logarithmic previous day's closing price

$$\text{Log Return} = ln \frac{Close_i}{Close_{i-1}} $$

Reference: Wikipedia

In [ ]:
class feature_engineering(object):
    def __init__(self):
        super(feature_engineering, self).__init__()

    def generate_technical_indicator(self,data_df): 
        '''
        Function to generate additional technical indicators for the stock
        
        Input:
        data_df-- Dataframe containing stock finacials data
        
        Output:
        Stock finacials data with added Dataframe of feature obtained from feature engineering
        ''' 
        # 1. Momentum Indicators
        # Relative Strength Index
        df = data_df
        df['rsi'] = ta.momentum.rsi(df['Close'], window=14)
        #Kaufman’s Adaptive Moving Average (KAMA)
        df['kama'] = ta.momentum.kama(df['Close'],window=14)

        # 2. Volume Indicators
        # Accumulation/Distribution Index (ADI)
        df['adi'] = ta.volume.acc_dist_index(df['High'], df['Low'], df['Close'], df['Volume'])

        # Volume-price trend (VPT)
        df['vpt'] = ta.volume.volume_price_trend(df['Close'], df['Volume'])

        # 3. Volatility Indicators
        # Average True Range (ATR)
        df['atr'] = ta.volatility.average_true_range(df['High'], df['Low'],df['Close'], window=14)

        # Bollinger Bands (BB) N-period simple moving average (MA)
        df['bb_ma'] = ta.volatility.bollinger_mavg(df['Close'], window=20)

        # 4. Trend Indicators
        # Average Directional Movement Index (ADX)
        df['adx'] = ta.trend.adx(df['High'], df['Low'], df['Close'], window=14)

        # Exponential Moving Average
        df['ema'] = ta.trend.ema_indicator(df['Close'], window=14)

        # Moving Average Convergence Divergence (MACD)
        df['macd'] = ta.trend.macd(df['Close'], window_fast=14, window_slow=30)

        # 5. Other Indicators
        # Daily Log Return (DLR)
        df['dlr'] = ta.others.daily_log_return(df['Close'])

        # Daily Returns
        df['daily_returns'] = df['Adj Close'].pct_change()

        # Moving Averages
        averages = [50,200]
        for avg in averages:
            col_name = str(avg) +' Days Average'
            df[col_name] = df['Adj Close'].rolling(window = avg, center = False).mean()

        return df
In [ ]:
#declaring feature_engineering class object
fe_obj = feature_engineering()

#Adding technical indicators
final_data = fe_obj.generate_technical_indicator(combined_data)
final_data.head()
In [ ]:
#Null value check
final_data.isna().sum()
Out[ ]:
Date                   0
High                   0
Low                    0
Open                   0
Close                  0
Volume                 0
Adj Close              0
Symbol                 0
GICS Sector            0
GICS Sub-Industry      0
rsi                   13
kama                  13
adi                    0
vpt                    0
atr                    0
bb_ma                 19
adx                    0
ema                   13
macd                  29
dlr                    1
daily_returns          1
50 Days Average       49
200 Days Average     199
dtype: int64
In [ ]:
#Cleaning out the NA values
final_data = final_data.dropna()
np.shape(final_data)
Out[ ]:
(628279, 23)
In [ ]:
#Null value check
final_data.isna().sum()
Out[ ]:
Date                 0
High                 0
Low                  0
Open                 0
Close                0
Volume               0
Adj Close            0
Symbol               0
GICS Sector          0
GICS Sub-Industry    0
rsi                  0
kama                 0
adi                  0
vpt                  0
atr                  0
bb_ma                0
adx                  0
ema                  0
macd                 0
dlr                  0
daily_returns        0
50 Days Average      0
200 Days Average     0
dtype: int64
In [ ]:
final_data.columns
Out[ ]:
Index(['Date', 'High', 'Low', 'Open', 'Close', 'Volume', 'Adj Close', 'Symbol',
       'GICS Sector', 'GICS Sub-Industry', 'rsi', 'kama', 'adi', 'vpt', 'atr',
       'bb_ma', 'adx', 'ema', 'macd', 'dlr', 'daily_returns',
       '50 Days Average', '200 Days Average'],
      dtype='object')
In [ ]:
#Applying Standard Scaling on the Contiunous features
train_data = final_data[['High', 'Low', 'Open', 'Close', 'Volume', 'Adj Close','rsi', 'kama', 'adi', 'vpt', 'atr',
       'bb_ma', 'adx', 'ema', 'macd', 'dlr', 'daily_returns',
       '50 Days Average', '200 Days Average']]
standard_df = StandardScaler().fit_transform(train_data)
standard_df.shape
Out[ ]:
(628279, 19)
In [ ]:
#Null value check
train_data.isna().sum()
Out[ ]:
High                0
Low                 0
Open                0
Close               0
Volume              0
Adj Close           0
rsi                 0
kama                0
adi                 0
vpt                 0
atr                 0
bb_ma               0
adx                 0
ema                 0
macd                0
dlr                 0
daily_returns       0
50 Days Average     0
200 Days Average    0
dtype: int64

Pair Selection

Data Preparation for Pair Selection

Below is the data created after initial data scraping and feature engineering. It consists of 5 year daily data for all the S&P 500 tickers

In [2]:
final_data = pd.read_csv('/Users/prakharpatidar/Google Drive/DS/InfoViz DS 5500/Data/Work_data/final_data.csv', index_col = 0)
final_data = final_data.dropna().reset_index(drop=True)
final_data.head()
Out[2]:
Date High Low Open Close Volume Adj Close Symbol GICS Sector GICS Sub-Industry ... vpt atr bb_ma adx ema macd dlr daily_returns 50 Days Average 200 Days Average
0 2017-03-31 230.800003 228.729996 230.529999 229.720001 2858400.0 213.102936 GS Financials Investment Banking & Brokerage ... 19733.485599 4.362371 239.862500 28.132003 234.506712 -4.270697 -0.650846 -0.006487 223.794005 183.652283
1 2017-04-03 230.100006 225.570007 230.000000 228.960007 3735600.0 212.397903 GS Financials Investment Banking & Brokerage ... -30902.056304 4.374344 238.710001 28.838974 233.767151 -4.376877 -0.331384 -0.003308 223.745124 184.040659
2 2017-04-04 230.690002 227.279999 227.720001 229.259995 3042700.0 212.676163 GS Financials Investment Banking & Brokerage ... -8372.073893 4.305463 237.628001 29.309057 233.166197 -4.404667 0.130936 0.001310 223.693110 184.434975
3 2017-04-05 232.889999 227.309998 232.149994 227.660004 5286900.0 211.191956 GS Financials Investment Banking & Brokerage ... -32910.333192 4.396501 236.499001 29.063049 232.432038 -4.499416 -0.700340 -0.006979 223.592723 184.812176
4 2017-04-06 230.119995 225.710007 227.089996 228.639999 2927200.0 212.101059 GS Financials Investment Banking & Brokerage ... -24296.377285 4.397465 235.422001 29.111926 231.926433 -4.470089 0.429541 0.004305 223.444456 185.191166

5 rows × 23 columns

In [3]:
final_data.columns
Out[3]:
Index(['Date', 'High', 'Low', 'Open', 'Close', 'Volume', 'Adj Close', 'Symbol',
       'GICS Sector', 'GICS Sub-Industry', 'rsi', 'kama', 'adi', 'vpt', 'atr',
       'bb_ma', 'adx', 'ema', 'macd', 'dlr', 'daily_returns',
       '50 Days Average', '200 Days Average'],
      dtype='object')
In [4]:
df = final_data.pivot(index='Date', columns='Symbol', values='Adj Close')
df.head()
Out[4]:
Symbol A AAL AAP AAPL ABBV ABC ABMD ABT ACN ADBE ... XEL XLNX XOM XRAY XYL YUM ZBH ZBRA ZION ZTS
Date
2016-06-16 43.403851 28.004217 151.626587 22.782436 48.650112 69.610771 100.720001 34.157757 108.860588 97.190002 ... 36.951801 43.106903 71.028473 61.166782 42.644634 54.472233 112.176666 56.619999 23.635189 45.757252
2016-06-17 43.394272 28.196426 152.227386 22.263964 47.939137 69.335968 99.230003 34.057552 108.188843 95.580002 ... 36.969048 42.701965 70.639130 60.887432 42.795029 54.886093 111.955605 56.930000 23.814035 44.963329
2016-06-20 44.140491 28.552002 153.241913 22.210251 48.019020 69.949654 101.720001 34.385471 109.550766 97.989998 ... 36.882778 42.794003 70.950615 60.993382 43.330788 55.214558 113.080154 57.560001 23.849810 45.418388
2016-06-21 44.159615 28.849916 152.119064 22.399418 47.915176 70.132835 101.760002 34.367256 109.900421 99.720001 ... 36.960419 42.959656 71.269867 61.070457 43.227394 55.372219 113.983635 57.200001 23.983950 45.786297
2016-06-22 44.035240 28.724991 150.858307 22.315344 48.242691 70.645752 101.699997 35.287235 109.532349 94.010002 ... 36.813755 42.720375 70.989532 61.213192 43.152203 55.109451 114.253296 57.599998 24.118084 45.825024

5 rows × 505 columns

In [5]:
exp_data = df.copy()
In [6]:
pd.set_option('precision', 3)
exp_data.describe().T.head()
Out[6]:
count mean std min 25% 50% 75% max
Symbol
A 1259.0 74.310 23.633 40.429 60.947 68.341 82.590 144.620
AAL 1259.0 32.697 12.411 9.040 23.995 33.432 43.295 56.989
AAP 1259.0 144.489 26.087 74.564 131.176 151.545 161.235 208.590
AAPL 1259.0 59.951 33.037 21.496 37.304 46.380 74.037 142.704
ABBV 1259.0 76.187 17.533 45.366 60.690 78.011 86.101 117.210
In [7]:
exp_data.isnull().values.any()
Out[7]:
True
In [8]:
missingno.matrix(exp_data)
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd16d4cec10>
In [9]:
print('Data Shape before cleaning =', exp_data.shape)

missing_percentage = exp_data.isnull().mean().sort_values(ascending=False)
missing_percentage.head(10)
dropped_list = sorted(list(missing_percentage[missing_percentage > 0.2].index))
exp_data.drop(labels=dropped_list, axis=1, inplace=True)

print('Data Shape after cleaning =', exp_data.shape)
Data Shape before cleaning = (1259, 505)
Data Shape after cleaning = (1259, 497)
In [10]:
exp_data = exp_data.fillna(exp_data.median())
exp_data
Out[10]:
Symbol A AAL AAP AAPL ABBV ABC ABMD ABT ACN ADBE ... XEL XLNX XOM XRAY XYL YUM ZBH ZBRA ZION ZTS
Date
2016-06-16 43.404 28.004 151.627 22.782 48.650 69.611 100.72 34.158 108.861 97.19 ... 36.952 43.107 71.028 61.167 42.645 54.472 112.177 56.62 23.635 45.757
2016-06-17 43.394 28.196 152.227 22.264 47.939 69.336 99.23 34.058 108.189 95.58 ... 36.969 42.702 70.639 60.887 42.795 54.886 111.956 56.93 23.814 44.963
2016-06-20 44.140 28.552 153.242 22.210 48.019 69.950 101.72 34.385 109.551 97.99 ... 36.883 42.794 70.951 60.993 43.331 55.215 113.080 57.56 23.850 45.418
2016-06-21 44.160 28.850 152.119 22.399 47.915 70.133 101.76 34.367 109.900 99.72 ... 36.960 42.960 71.270 61.070 43.227 55.372 113.984 57.20 23.984 45.786
2016-06-22 44.035 28.725 150.858 22.315 48.243 70.646 101.70 35.287 109.532 94.01 ... 36.814 42.720 70.990 61.213 43.152 55.109 114.253 57.60 24.118 45.825
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2021-06-10 143.060 23.460 195.850 126.110 116.240 120.940 298.13 110.150 284.220 535.52 ... 69.171 128.790 62.750 66.070 118.230 118.710 159.260 508.63 55.080 182.100
2021-06-11 143.530 23.530 199.440 127.350 115.420 119.840 299.19 109.910 285.280 541.26 ... 69.082 128.500 62.170 65.570 118.160 119.650 158.280 508.48 55.270 182.200
2021-06-14 144.390 22.990 198.590 130.480 115.400 118.730 302.87 110.480 285.710 556.95 ... 69.080 128.330 62.070 64.670 117.270 118.060 157.550 509.92 54.020 184.580
2021-06-15 144.620 22.790 200.660 129.640 115.830 119.030 301.72 110.410 286.190 548.46 ... 69.190 127.040 64.330 65.890 118.090 118.570 158.340 506.10 54.520 185.670
2021-06-16 143.590 22.830 196.920 130.150 115.530 118.360 303.30 110.060 283.940 543.33 ... 67.760 126.180 64.100 65.520 115.490 118.140 158.030 506.70 55.130 184.110

1259 rows × 497 columns

In [11]:
missingno.matrix(exp_data)
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd16bb15bb0>
In [12]:
returns = exp_data.pct_change().mean()*266
returns = pd.DataFrame(returns)
returns.columns = ['returns']

returns['volatility'] = exp_data.pct_change().std()*np.sqrt(266)

ret_data = returns
ret_data.head()
Out[12]:
returns volatility
Symbol
A 0.290 0.270
AAL 0.126 0.590
AAP 0.122 0.364
AAPL 0.417 0.311
ABBV 0.226 0.291
In [13]:
#Prepare the scaler
scale = StandardScaler().fit(ret_data)

#Fit the scaler
scaled_data = pd.DataFrame(scale.fit_transform(ret_data),columns = ret_data.columns, index = ret_data.index)
X = scaled_data
X.head()
Out[13]:
returns volatility
Symbol
A 0.516 -0.783
AAL -0.721 2.710
AAP -0.747 0.247
AAPL 1.478 -0.332
ABBV 0.032 -0.555

Cluster Analysis

Cluster Analysis is a technique used to group sets of objects that share similar characteristics. It has been prominently used in the finance domain by investors to develop a cluster trading approach to help them build a diversified portfolio, helping protect the investor’s portfolio against systemic risks that could make the portfolio vulnerable to losses.

Here we explore three different clustering techniques for extracting different clusters of stocks exhibiting minimal correlation from one another. The three techniques are compared based on Silhouette Scores and the clusters from the technique with highest score is selected for further analysis. The pairs in each cluster are tested for cointegration via the ADF test giving us optimal pairs for our pair trading strategy.

K-Means Clustering

K-Means algorithm is an iterative algorithms that partitions the datasets into k distinct non-overlapping clusters where each datapoint belongs exactly to one cluster. It tries to cluster most similar (homogenous) points together while also maximizing distance between each individual cluster. The elbow method is used to iterate through the values of k and calculate the distortion for each value of k, and distortion and inertia for each value of k in the specified range, where distortion is the average of

the squared distances of each feature to the closest cluster center. We use Silhouette method to reaffirm the optimal number of clusters (K). The K-Means clustering model is trained using the found optimal K value to generate clusters for our pair selection.

In [14]:
K = range(1,15)
distortions = []

#Fit the method
for k in K:
    kmeans = KMeans(n_clusters = k)
    kmeans.fit(X)
    distortions.append(kmeans.inertia_)

#Plot the results
fig = plt.figure(figsize= (15,5))
plt.plot(K, distortions, 'bx-')
plt.xlabel('Values of K')
plt.ylabel('Distortion')
plt.title('Elbow Method')
plt.grid(True)
plt.show()
In [15]:
kl = KneeLocator(K, distortions, curve="convex", direction="decreasing")
kl.elbow
Out[15]:
4
In [16]:
#For the silhouette method k needs to start from 2
K = range(2,15)
silhouettes = []

#Fit the method
for k in K:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10, init='random')
    kmeans.fit(X)
    silhouettes.append(silhouette_score(X, kmeans.labels_))

#Plot the results
fig = plt.figure(figsize= (15,5))
plt.plot(K, silhouettes, 'bx-')
plt.xlabel('Values of K')
plt.ylabel('Silhouette score')
plt.title('Silhouette Method')
plt.grid(True)
plt.show()

kl = KneeLocator(K, silhouettes, curve="convex", direction="decreasing")
print('Suggested number of clusters: ', kl.elbow)
Suggested number of clusters:  8
In [17]:
c = 8
#Fit the model
k_means = KMeans(n_clusters=c)
k_means.fit(X)
prediction = k_means.predict(X)

#Plot the results
centroids = k_means.cluster_centers_
fig = plt.figure(figsize = (18,10))
ax = fig.add_subplot(111)
scatter = ax.scatter(X.iloc[:,0],X.iloc[:,1], c=k_means.labels_, cmap="rainbow", label = X.index)
ax.set_title('k-Means Cluster Analysis Results')
ax.set_xlabel('Mean Return')
ax.set_ylabel('Volatility')
plt.colorbar(scatter)
plt.plot(centroids[:,0],centroids[:,1],'sg',markersize=10)
plt.show()
In [18]:
clustered_series = pd.Series(index=X.index, data=k_means.labels_.flatten())
clustered_series_all = pd.Series(index=X.index, data=k_means.labels_.flatten())
clustered_series = clustered_series[clustered_series != -1]
plt.figure(figsize=(12,8))
plt.barh(range(len(clustered_series.value_counts())),clustered_series.value_counts())
plt.title('Clusters')
plt.xlabel('Stocks per Cluster')
plt.ylabel('Cluster Number')
plt.show()
In [19]:
counts = clustered_series.value_counts()

cluster_vis_list = list(counts[(counts<20) & (counts>1)].index)[::-1]

clust1 = pd.DataFrame(clustered_series,columns=['label'])
grouped_df = clust1.groupby('label')
In [20]:
pivot_data = final_data.pivot(index='Symbol', columns='Date', values='Adj Close')
pivot_data.head()
Out[20]:
Date 2016-06-16 2016-06-17 2016-06-20 2016-06-21 2016-06-22 2016-06-23 2016-06-24 2016-06-27 2016-06-28 2016-06-29 ... 2021-06-03 2021-06-04 2021-06-07 2021-06-08 2021-06-09 2021-06-10 2021-06-11 2021-06-14 2021-06-15 2021-06-16
Symbol
A 43.404 43.394 44.140 44.160 44.035 44.657 42.208 40.429 41.242 41.864 ... 136.37 137.90 137.65 138.75 140.13 143.06 143.53 144.39 144.62 143.59
AAL 28.004 28.196 28.552 28.850 28.725 29.148 25.996 24.285 25.727 26.620 ... 24.93 24.30 24.25 24.22 23.85 23.46 23.53 22.99 22.79 22.83
AAP 151.627 152.227 153.242 152.119 150.858 152.602 152.198 151.469 154.621 158.590 ... 191.14 192.76 193.01 197.87 197.03 195.85 199.44 198.59 200.66 196.92
AAPL 22.782 22.264 22.210 22.399 22.315 22.444 21.813 21.496 21.858 22.047 ... 123.54 125.89 125.90 126.74 127.13 126.11 127.35 130.48 129.64 130.15
ABBV 48.650 47.939 48.019 47.915 48.243 48.978 47.819 46.877 47.923 49.377 ... 112.21 112.36 113.01 112.34 114.00 116.24 115.42 115.40 115.83 115.53

5 rows × 1259 columns

In [21]:
#visualizing time series for clusters in K-means
for key, item in grouped_df:
    print('Group: ',key)
    grp_df = grouped_df.get_group(key)
    Syms = grp_df.index
    means = np.log(pivot_data.loc[Syms,].T.mean())
    temp =  np.log(pivot_data.loc[Syms,].T.sub(means))
    # temp.plot()
    # plt.show()
    temp = temp.reset_index()
    col_lst = temp.columns.values.tolist()
    col_lst = col_lst.remove('Date')
    df_melt = temp.melt(id_vars='Date', value_vars=col_lst)
    fig = px.line(df_melt, x='Date', y='value',color='Symbol')
    fig.show()
Group:  0
Group:  1
Group:  2
Group:  3
Group:  4
Group:  5
Group:  6
Group:  7

Heirarchiical Clustering

Hierarchical Clustering is an unsupervised algorithm that groups similar objects into distinct clusters. The method performs the clustering by creating a tree of clusters by grouping and separating features on each iteration. The product of the clustering process is visualized in a figure known as “dendrogram”. The groupage can be performed by an agglomerative (bottom-up) or divisive (top-down) approach. The clustering technique can be applied using different similarity measures such as Ward, Average, Complete or Single Linkage. The main benefit of this algorithms is that, unlike K-Means, it doesn’t require us to specify the number of clusters in advance.

Since we want to minimize the variance distance between our clusters, for our purpose, we apply Hierarchical Clustering using Ward linkage. A distance threshold of 13.5, selected using hyper parameter tuning, is used for extracting the number of clusters for training the Hierarchical (Agglomerative) Clustering model.

In [22]:
#plotting dendrogram
plt.figure(figsize=(15, 10))  
plt.title("Dendrograms")  
dend = shc.dendrogram(shc.linkage(X, method='ward'))
In [23]:
#plotting cutoff threshold for dendrogram
plt.figure(figsize=(15, 10))  
plt.title("Dendrogram")  
dend = shc.dendrogram(shc.linkage(X, method='ward'))
plt.axhline(y=13.5, color='purple', linestyle='--')
Out[23]:
<matplotlib.lines.Line2D at 0x7fd1710f7f70>
In [24]:
#Fit the model
clusters = 4
hc = AgglomerativeClustering(n_clusters= clusters, affinity='euclidean', linkage='ward')
labels = hc.fit_predict(X)

#Plot the results
fig = plt.figure(figsize=(15,10))
ax = fig.add_subplot(111)
scatter = ax.scatter(X.iloc[:,0], X.iloc[:,1], c=labels, cmap='rainbow')
ax.set_title('Hierarchical Clustering Results')
ax.set_xlabel('Mean Return')
ax.set_ylabel('Volatility')
plt.colorbar(scatter)
plt.show()

Affinity Proprogation Clustering

Affinity Propagation Clustering is a method that creates clusters by a criterion of how well suited an instance is to be a representative of another. This can be imagined as instances messaging each other on how much they suit one another. After that, an instance that received messages from multiple senders will send back the revised value of attractiveness to each sender. This messaging will proceed until an agreement is reached. When a sender gets associated with the receiver the receiver will become the exemplar. All data points with the same exemplar will then create a cluster. The advantage of using Affinity Propagation Clustering is that it determines the clusters by itself in fully unsupervised manner.

In [25]:
#Fit the model
ap = AffinityPropagation()
ap.fit(X)
labels1 = ap.predict(X)

#Plot the results
fig = plt.figure(figsize=(15,10))
ax = fig.add_subplot(111)
scatter = ax.scatter(X.iloc[:,0], X.iloc[:,1], c=labels1, cmap='rainbow')
ax.set_title('Affinity Propagation Clustering Results')
ax.set_xlabel('Mean Return')
ax.set_ylabel('Volatility')
plt.colorbar(scatter)
plt.show()
In [26]:
#Extract the cluster centers and labels
cci = ap.cluster_centers_indices_
labels2 = ap.labels_

#Print their number
clusters = len(cci)
print('The number of clusters is:',clusters)

#Plot the results
X_ap = np.asarray(X)
plt.close('all')
plt.figure(1)
plt.clf
fig=plt.figure(figsize=(15,10))
colors = cycle('cmykrgbcmykrgbcmykrgbcmykrgb')
for k, col in zip(range(clusters),colors):
    cluster_members = labels2 == k
    cluster_center = X_ap[cci[k]]
    plt.plot(X_ap[cluster_members, 0], X_ap[cluster_members, 1], col + '.')
    plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col, markeredgecolor='k', markersize=12)
    for x in X_ap[cluster_members]:
        plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)

plt.show()
The number of clusters is: 27
<Figure size 432x288 with 0 Axes>

Model Selection

In [27]:
print("k-Means Clustering", metrics.silhouette_score(X, k_means.labels_, metric='euclidean'))
print("Hierarchical Clustering", metrics.silhouette_score(X, hc.fit_predict(X), metric='euclidean'))
print("Affinity Propagation Clustering", metrics.silhouette_score(X, ap.labels_, metric='euclidean'))
k-Means Clustering 0.34353301803878894
Hierarchical Clustering 0.29202574336792847
Affinity Propagation Clustering 0.3248563629988216

Silhouette Score was used as an evaluation metric for the clustering techniques. K-means, having the highest Silhouette score out of the three, was selected as the most appropriate approach.

In [28]:
cluster_size_limit = 1000
counts = clustered_series.value_counts()
ticker_count = counts[(counts>1) & (counts<=cluster_size_limit)]
print ("Number of clusters: %d" % len(ticker_count))
print ("Number of Pairs: %d" % (ticker_count*(ticker_count-1)).sum())
Number of clusters: 8
Number of Pairs: 42638

Cointegrated Augmented Dickey Fuller Test

A Cointegration test is used to explore if there is a correlation between several time series in the long term. The test was introduced as an alternative to linear regression for analyzing time series to solve the issue of spurious correlation. The tests identify scenarios where two or more non-stationary time series are integrated together in a way that they cannot deviate in the long term. The tests are used to identify the degree of sensitivity of two variables to the same average price over a specified period of time.

We apply Cointegrated Augmented Dickey Fuller test for testing the stationarity of stocks followed by Cointegration testing on stocks within a cluster to find cointegrated pairs which could be used for training our spread prediction models.

In [ ]:
#defining function for cointegrated augmented dickey fuller test
def find_cadf_pairs(data):
    n = data.shape[1]
    score_matrix = np.zeros((n, n))
    pvalue_matrix = np.ones((n, n))
    keys = data.keys()
    pairs = {}
    for i in range(n):
        for j in range(i+1, n):
            S1 = data[keys[i]]
            S2 = data[keys[j]]
            ols = sm.OLS(S1,S2).fit()
            pred = ols.predict(S2)
            adf = ts.adfuller(S1-pred)
            score = adf[0]
            pvalue = adf[1]
            score_matrix[i, j] = score
            pvalue_matrix[i, j] = pvalue
            if pvalue < 0.05:
                pairs[(keys[i], keys[j])] = adf
    return score_matrix, pvalue_matrix, pairs
In [ ]:
#extracting cointegrated pairs
adf_dict = {}

for i, clust in enumerate(ticker_count.index):
    tickers = clustered_series[clustered_series == clust].index
    score_matrix, pvalue_matrix, adf_pairs = find_cadf_pairs(data1[tickers])
    adf_dict[clust] = {}
    adf_dict[clust]['score_matrix'] = score_matrix
    adf_dict[clust]['pvalue_matrix'] = pvalue_matrix
    adf_dict[clust]['pairs'] = adf_pairs
    
adf_pairs = []   
for cluster in adf_dict.keys():
    adf_pairs.extend(adf_dict[cluster]['pairs'])
    
print ("Number of pairs:", len(adf_pairs))
print ("In those pairs, we found %d unique tickers." % len(np.unique(adf_pairs)))
print(adf_pairs)
In [29]:
#loading in previously saved adf pairs
with open('/Users/prakharpatidar/Google Drive/DS/InfoViz DS 5500/Data/Work_data/cadf_pairs.pkl', 'rb') as f:
    cadf_dict = pickle.load(f)

adf_pairs = []   
for cluster in cadf_dict.keys():
    adf_pairs.extend(cadf_dict[cluster]['pairs'])

print ("Number of pairs:", len(adf_pairs))
print ("In those pairs, we found %d unique tickers." % len(np.unique(adf_pairs)))
Number of pairs: 1846
In those pairs, we found 454 unique tickers.

Visualizing Pairs

TSNE
In [30]:
#TSNE visualization
def plot_tsne(pairs):
    #generating data for tsne
    stocks = np.unique(pairs)
    X_data = pd.DataFrame(index=X.index, data=X).T
    in_pairs_series = clustered_series.loc[stocks]
    stocks = list(np.unique(pairs))
    X_pairs = X_data.T.loc[stocks]
    
    #applying tsne
    X_tsne = TSNE(learning_rate=30, perplexity=5, random_state=42, n_jobs=-1).fit_transform(X_pairs)
    
    #visualizing tsne
    plt.figure(1, facecolor='white',figsize=(18,15))
    plt.clf()
    plt.axis('off')
    for pair in pairs:
        ticker1 = pair[0]
        loc1 = X_pairs.index.get_loc(pair[0])
        x1, y1 = X_tsne[loc1, :]
        ticker2 = pair[0]
        loc2 = X_pairs.index.get_loc(pair[1])
        x2, y2 = X_tsne[loc2, :]
        plt.plot([x1, x2], [y1, y2], 'k-', alpha=0.3, c='b');

    plt.scatter(X_tsne[:, 0], X_tsne[:, 1], s=215, alpha=0.8, c=in_pairs_series.values, cmap=cm.Paired)
    plt.title('TSNE Visualization of Pairs'); 

    # Join pairs by x and y
    for x,y,name in zip(X_tsne[:,0],X_tsne[:,1],X_pairs.index):

        label = name

        plt.annotate(label,
                     (x,y),
                     textcoords="offset points",
                     xytext=(0,10),
                     ha='center')

    plt.show()
In [31]:
#plotting TSNE for pairs in just one cluster
plot_tsne(list(cadf_dict[5]['pairs'].keys()))
In [32]:
#plotting TSNE for pairs in all clusters
plot_tsne(adf_pairs)
In [33]:
final_data = pd.read_csv('/Users/prakharpatidar/Google Drive/DS/InfoViz DS 5500/Data/Work_data/final_data.csv', index_col = 0)
final_data = final_data.dropna().reset_index(drop=True)
final_data.head()
Out[33]:
Date High Low Open Close Volume Adj Close Symbol GICS Sector GICS Sub-Industry ... vpt atr bb_ma adx ema macd dlr daily_returns 50 Days Average 200 Days Average
0 2017-03-31 230.80 228.73 230.53 229.72 2.858e+06 213.103 GS Financials Investment Banking & Brokerage ... 19733.486 4.362 239.863 28.132 234.507 -4.271 -0.651 -0.006 223.794 183.652
1 2017-04-03 230.10 225.57 230.00 228.96 3.736e+06 212.398 GS Financials Investment Banking & Brokerage ... -30902.056 4.374 238.710 28.839 233.767 -4.377 -0.331 -0.003 223.745 184.041
2 2017-04-04 230.69 227.28 227.72 229.26 3.043e+06 212.676 GS Financials Investment Banking & Brokerage ... -8372.074 4.305 237.628 29.309 233.166 -4.405 0.131 0.001 223.693 184.435
3 2017-04-05 232.89 227.31 232.15 227.66 5.287e+06 211.192 GS Financials Investment Banking & Brokerage ... -32910.333 4.397 236.499 29.063 232.432 -4.499 -0.700 -0.007 223.593 184.812
4 2017-04-06 230.12 225.71 227.09 228.64 2.927e+06 212.101 GS Financials Investment Banking & Brokerage ... -24296.377 4.397 235.422 29.112 231.926 -4.470 0.430 0.004 223.444 185.191

5 rows × 23 columns

Heatmap Visualization
In [34]:
#Heatmap visualization of each cluster pairs
fig, axs = plt.subplots(4,2, figsize=(12,12))
fig.subplots_adjust(hspace = .5, wspace=.1)

clust = 0
for i in range(4):
    for j in range(2):
        sns.heatmap(cadf_dict[clust]['pvalue_matrix'],  cmap='RdYlGn_r' 
                    , mask = (cadf_dict[clust]['pvalue_matrix'] >= 0.98), ax=axs[i][j])
        axs[i][j].set_title(str('Cluster '+ str(clust)))
        clust +=1
fig.suptitle("P-value heatmap per cluster")
plt.show()
In [35]:
#Finding the lowest p-value pair for each cluster
pairs_per_cluster_dict = defaultdict(lambda: defaultdict(list))

for i in cadf_dict.keys():
    #finding best p_value matrix for the cluster
    best_pvalue = sorted(cadf_dict[i]['pvalue_matrix'], key = lambda kv: sorted(kv))[0]
    #finding index of the value in list
    idx_pvalue = next(idx for idx,item in enumerate(cadf_dict[i]['pvalue_matrix']) if (item == best_pvalue).all())
    if len(list(cadf_dict[i]['pairs'].keys())) !=0:
        pairs_per_cluster_dict[i]['stock_pair'] = list(cadf_dict[i]['pairs'].keys())[idx_pvalue]
        pairs_per_cluster_dict[i]['pvalue_matrix'] = best_pvalue
        pairs_per_cluster_dict[i]['score_matrix'] = cadf_dict[i]['score_matrix'][0][idx_pvalue]
In [36]:
#displaying best pairs for all clusters
[pairs_per_cluster_dict[clust]['stock_pair'] for clust in cadf_dict.keys()]
Out[36]:
[('ABBV', 'MNST'),
 ('AEP', 'BDX'),
 ('A', 'LOW'),
 ('COF', 'NUE'),
 ('ALK', 'NLSN'),
 ('GPS', 'MGM'),
 ('BBY', 'MU'),
 []]

Spread Forecasting

We pick one of the above pairs for further analysis.

Taking, Stock 1 (P1) = ABBV and Stock 2 (P2) = MNST

The create_timeseries() function creates a timeseries for a given pair of stocks P1 and P2, utilizing the features of each stock suffixed by P1 and P2. It merges the features of each stock by Date to create a wide format single dataframe

In [37]:
def create_timeseries(final_data, P1_name, P2_name):

    cols = final_data.columns.drop(['GICS Sector', 'GICS Sub-Industry', 'Symbol'])
    
    # subset the original data for the selected pair
    P1_data = final_data[final_data['Symbol']==P1_name][cols]
    P2_data = final_data[final_data['Symbol']==P2_name][cols]

    # rename columns to identify the features 
    P1_data.columns = ['P1_' + s for s in cols]
    P2_data.columns = ['P2_' + s for s in cols]

    # merge both the data on Date to create a single time series wide format data
    pair_data = pd.merge(P1_data, P2_data, left_on='P1_Date', right_on='P2_Date')
    pair_data.index = pair_data['P1_Date']
    del pair_data['P2_Date']
    del pair_data['P1_Date']
    return pair_data
In [38]:
P1 = 'ABBV'
P2 = 'MNST'
pair_data = create_timeseries(final_data, P1, P2)
pair_data.head()
Out[38]:
P1_High P1_Low P1_Open P1_Close P1_Volume P1_Adj Close P1_rsi P1_kama P1_adi P1_vpt ... P2_vpt P2_atr P2_bb_ma P2_adx P2_ema P2_macd P2_dlr P2_daily_returns P2_50 Days Average P2_200 Days Average
P1_Date
2016-06-16 60.94 59.91 59.93 60.90 7.011e+06 48.650 64.445 57.036 3.758e+10 751792.655 ... 1.260e+06 2.120 42.509 15.148 43.197 0.296 21.695 0.242 40.815 42.703
2016-06-17 60.97 59.57 60.88 60.01 1.257e+07 47.939 60.513 57.117 3.758e+10 550035.365 ... 1.226e+06 2.029 43.033 18.272 44.369 0.882 0.103 0.001 41.014 42.827
2016-06-20 61.18 60.06 60.64 60.11 7.076e+06 48.019 60.803 57.192 3.757e+10 -171976.379 ... 1.494e+04 1.915 43.528 21.212 45.394 1.354 0.122 0.001 41.222 42.949
2016-06-21 60.40 59.75 60.20 59.98 6.366e+06 47.915 60.185 57.254 3.757e+10 -1976.482 ... 4.646e+04 1.835 44.082 24.016 46.357 1.763 1.057 0.011 41.484 43.074
2016-06-22 61.18 59.92 60.00 60.39 6.719e+06 48.243 61.513 57.336 3.757e+10 32156.772 ... 2.143e+04 1.734 44.652 26.620 47.150 2.059 -0.572 -0.006 41.725 43.196

5 rows × 38 columns

In [39]:
pair_data['P1_Close'].plot(figsize=(15,7))
pair_data['P2_Close'].plot(figsize=(15,7))
plt.ylabel('Spread')
plt.xlabel('Date')
plt.title('Spread_Close')
plt.show()

Ground Truth Spread

The strategy uses the spread divergence to open/close a position. We open a position with the spread when the spread diverges from 0, hoping that it will converge again to its mean which is 0.

The ground_spread() function is used run linear regression of $ log(p_1) = \beta X log(p_2) + \varepsilon$, where $p_1$ and $p_2$ are daily closing prices of the stocks P1 and P2, the coefficient $\beta$ is the cointegration coeff and the stochastic term $\varepsilon$ is the spread.

In [40]:
def ground_spread(pair1, pair2):
    est = est = sm.OLS(pair1, pair2)
    est = est.fit()
    beta = -est.params[0]
    return pair1 + (pair2 * beta)
In [41]:
def acc_metric(true_value, predicted_value):
    acc_met = 0.0
    m = len(true_value)
    for i in range(m):
        acc_met += mean_squared_error(true_value, predicted_value)
    acc_met /= m
    return np.sqrt(acc_met)
In [42]:
#Calculating spread for close prices of both stocks
pair_data['Spread_Close'] = ground_spread(pair_data.P1_Close, pair_data.P2_Close)
pair_data['Spread_Close'].plot(figsize=(15,7))
plt.axhline(pair_data['Spread_Close'].mean())
plt.title('Spread for Ground Truth')
plt.ylabel('Spread')
plt.xlabel('Date')
plt.legend(['Spread_Close'])
plt.show()
In [43]:
#Calculating spread for open prices of both stocks
pair_data['Spread_Open'] = ground_spread(pair_data.P1_Open, pair_data.P2_Open)

#Calculating spread for high prices of both stocks
pair_data['Spread_High'] = ground_spread(pair_data.P1_High, pair_data.P2_High)

#Calculating spread for low prices of both stocks
pair_data['Spread_Low'] = ground_spread(pair_data.P1_Low, pair_data.P2_Low)

Preprocessing Data

In [44]:
#Creating train test split
train_size = int(len(pair_data) * 0.9)
val_size = int((len(pair_data) - train_size) * 0.5) - 30
test_size = len(pair_data) - train_size - val_size

#splitting the data
train, val, test = pair_data[0:train_size], pair_data[train_size:train_size + val_size], pair_data[train_size + val_size:len(pair_data)]
print(len(train), len(val), len(test))
1133 33 93
In [45]:
# Function for creating dataset with look back
def create_dataset(dataset, look_back=1):
    dataX, dataY = [], []
    for i in range(len(dataset) - look_back):
        a = dataset[i, :]
        dataX.append(a)
        dataY.append(dataset[(i+1):(i+1+look_back), 0])
    print(len(dataY))
    return dataX, scaler.fit_transform(dataX), dataY, scaler.fit_transform(dataY)
In [46]:
# Creating data for look back period of S_t -1 and performing train_test split
look_back = 1
scaler = MinMaxScaler(feature_range=(0, 1))

# Generate dataset for trainX, trainY, testX, testY
X_train_unorm, X_train, Y_train_unorm, Y_train = create_dataset(train.values, look_back)
X_val_unorm, X_val, Y_val_unorm, Y_val = create_dataset(val.values, look_back)
X_test_unorm, X_test, Y_test_unorm, Y_test = create_dataset(test.values, look_back)
1132
32
92

ARIMA

We begin with testing ARIMA as our first model and making predictions 1-time step ahead.

ARIMA models are the most general class of models for forecasting on time series.

In [47]:
#Function to fit ARIMA model
def fit_ARIMA(data):
    yhat_ARIMA = []
    yhat_ARIMA_mse = []
    for i in range(train_size+val_size, len(data)-1):
        model = ARIMA(data[:i], order=(1,0,0))
        model_fit = model.fit(disp=0)
        temp = []
        forecast = (model_fit.forecast(steps=look_back)[0])
        yhat_ARIMA.append(forecast[0])
        for j in range(len(forecast)):
            temp.append(forecast[j])
        yhat_ARIMA_mse.extend(temp)
    return yhat_ARIMA, yhat_ARIMA_mse

#function to normalize a series
def normalize(series):
    return (series - np.mean(series)) / np.std(series)
In [48]:
#Arima Model on Spread close values
yhat_ARIMA, yhat_ARIMA_mse = fit_ARIMA(pair_data['Spread_Close'].values)

#calculating accuracy metric MSE
mse_ARIMA = acc_metric(Y_test_unorm, yhat_ARIMA_mse)
# print('ARIMA MSE: ', mse_ARIMA)
In [49]:
#plotting normalized Y_test and Prediction from ARIMA
plt.figure(figsize = (15,8))
plt.plot(pd.Series(normalize([i[0] for i in Y_test_unorm])), label='True Value')
plt.plot(pd.Series(normalize(yhat_ARIMA_mse)), label='Prediction Arima')
plt.title('Spread true vs predictied -- ARIMA')
plt.xlabel('timestep')
plt.ylabel('normalized spread')
plt.legend()
plt.show()

Kalman Filters

Moving on, we implemented Kalman Filters which also are very commonly used for predicting time series. The use of Kalman filters has been very well studied in literature in the field of prediction spread amongst cointegrated stocks.

In [50]:
def KalmanFilterAverage(x):
  # Construct a Kalman filter
    kf = KalmanFilter(transition_matrices = [1],
    observation_matrices = [1],
    initial_state_mean = 0,
    initial_state_covariance = 1,
    observation_covariance=1,
    transition_covariance=.01)
 
  # Use the observed values of the price to get a rolling mean
    state_means, _ = kf.filter(x.values)
    state_means = pd.Series(state_means.flatten(), index=x.index)
    return state_means
 
# Kalman filter regression
def KalmanFilterRegression(x,y):
    delta = 1e-3
    trans_cov = delta / (1 - delta) * np.eye(2) # How much random walk wiggles
    obs_mat = np.expand_dims(np.vstack([[x], [np.ones(len(x))]]).T, axis=1)
 
    kf = KalmanFilter(n_dim_obs=1, n_dim_state=2, # y is 1-dimensional, (alpha, beta) is 2-dimensional
    initial_state_mean=[0,0],
    initial_state_covariance=np.ones((2, 2)),
    transition_matrices=np.eye(2),
    observation_matrices=obs_mat,
    observation_covariance=2,
    transition_covariance=trans_cov)
 
    # Use the observations y to get running estimates and errors for the state parameters
    state_means, state_covs = kf.filter(y.values)
    return state_means

#function to forecast values
def forecast_kalman_lookback(Y_test_unorm, forecast, look_back=1):
    yhat_KF = forecast
    yhat_KF_mse = []
    mse = 0.0
    if look_back == 1:
        for i in range(len(forecast)):
            temp = []
            temp.append(forecast[i])
            yhat_KF_mse.append(np.array(temp))
        mse = acc_metric(normalize(Y_test_unorm), yhat_KF_mse)
    else:
        mse = 0.0
    
    return yhat_KF, yhat_KF_mse, mse
In [51]:
#fit Kalman filter regressor
state_means = - KalmanFilterRegression(KalmanFilterAverage(pair_data.P1_Close),KalmanFilterAverage(pair_data.P2_Close))[:,0]
#normalizing the results
results = normalize(pair_data.P1_Close + (pair_data.P2_Close * state_means))
#forecasting values using result
forecast = results[-len(X_test):].values

#finding MSE for prediction on lookback 1
yhat_Kalman, yhat_Kalman_mse, mse_kalman = forecast_kalman_lookback(Y_test_unorm, forecast, look_back=1)
# print('Kalman Filters MSE: ', mse_kalman)
In [52]:
#plotting normalized Y_test and Prediction from ARIMA
plt.figure(figsize = (15,8))
plt.plot(normalize(pair_data['Spread_Close']), label='True Value')
plt.plot(results, label='Prediction Arima')
plt.title('Spread true vs predictied -- Kalman Filters')
plt.xlabel('timestep')
plt.ylabel('normalized spread')
plt.axhline(normalize(pair_data['Spread_Close']).mean(), color='black')
plt.axhline(1.0, color='red', linestyle='--')
plt.axhline(-1.0, color='green', linestyle='--')
plt.legend(['Spread z-score', 'Kalman_Predicted_Spread', 'Mean', '+1', '-1'])
plt.show()

LSTM

After testing the data on the standard models, we move on to build our state-of-the-art Long short term memory neural network architecture.

LSTM models are capable of learning long term dependencies. The architecture consisted of 2 stacked LSTM neural layers followed by a fully connected layers and is used to predict the value of spread 1-time step in future.

In [53]:
#Creating the LSTM model
class lstm(Model):
    def __init__(self, in_shape, look_back):
        super(lstm, self).__init__()
        self.shape = in_shape
        self.look_back = look_back
        self.lstm_l1 = LSTM(256, input_shape=self.shape, return_sequences = True)
        self.lstm_l2 = LSTM(256)
        self.dropout = Dropout(0.2)
        self.dense = Dense(self.look_back)

    def call(self, x):
        x = self.lstm_l1(x)
        x = self.lstm_l2(x)
        x = self.dropout(x)
        x = self.dense(x)
        return x
In [54]:
# Reshape X for model training
trainX = np.reshape(X_train, (X_train.shape[0], 1, X_train.shape[1]))
valX = np.reshape(X_val, (X_val.shape[0], 1, X_val.shape[1]))
testX = np.reshape(X_test, (X_test.shape[0], 1, X_test.shape[1]))
In [55]:
#initializing the lstm model
lstm_model = lstm((trainX.shape[1], trainX.shape[2]), look_back = 1)
lstm_model.compile(loss='mse', 
                   optimizer='adam', 
                   metrics=['accuracy'])
#fitting the model
history = lstm_model.fit(trainX, 
                    Y_train, 
                    epochs=400, 
                    batch_size=100,
                    validation_data=(valX, Y_val), 
                    verbose=0, 
                    shuffle=True)
In [56]:
lstm_model.summary()
Model: "lstm"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm_1 (LSTM)                multiple                  306176    
_________________________________________________________________
lstm_2 (LSTM)                multiple                  525312    
_________________________________________________________________
dropout (Dropout)            multiple                  0         
_________________________________________________________________
dense (Dense)                multiple                  257       
=================================================================
Total params: 831,745
Trainable params: 831,745
Non-trainable params: 0
_________________________________________________________________
In [57]:
# Plot line graph to show amount loss according the the epoch
plt.figure(figsize=(8,6))
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='val')
plt.title('Loss Plot -- LSMT')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.show()
In [58]:
# Predictions on X_test 
yhat = lstm_model.predict(testX)
plt.figure(figsize=(15,8))
plt.plot(Y_test, label='True Value')
plt.plot(yhat, label='Predicted Value -- LSTM')
plt.title('Spread true vs predictied -- LSTM')
plt.xlabel('timestep')
plt.ylabel('normalized spread')
plt.legend()
plt.show()
In [59]:
# Scaler Inverse Y back to normal value
yhat_LSTM = scaler.inverse_transform(yhat)
Y_test_LSTM = scaler.inverse_transform(Y_test)

# Training and Test error
mse_LSTM_train = acc_metric(scaler.inverse_transform(Y_train), scaler.inverse_transform(lstm_model.predict(trainX)))
mse_LSTM_test = acc_metric(yhat_LSTM, Y_test_LSTM)
# print('LSTM MSE: ')
# print('Train-- ', mse_LSTM_train)
# print('Test-- ', mse_LSTM_test)
In [60]:
predictDates = pair_data.tail(len(testX)).index
testY_reshape = normalize(Y_test_LSTM).reshape(len(Y_test_LSTM))
yhat_reshape = normalize(yhat_LSTM).reshape(len(yhat_LSTM))
kalman_reshape = normalize(yhat_Kalman).reshape(len(yhat_Kalman))
#Plot predicted and actual line graph
layout = go.Layout(
        yaxis=dict(
        title='Spread',
        titlefont=dict(
            family='Arial, sans-serif',
            size=18
        )))
actual_chart = go.Scatter(x=predictDates, y=testY_reshape, name= 'Actual Spread')
predict_chart = go.Scatter(x=predictDates, y=yhat_reshape, name= 'LSTM Predict Spread')
predict_kalman_chart = go.Scatter(x=predictDates, y=kalman_reshape, name= 'Kalman Filter Predict Spread')
fig = go.Figure(data = [predict_kalman_chart, predict_chart, actual_chart], layout = layout)
fig.update_layout(
    title='Actual vs Predicted Spread')
py.iplot(fig)

Developing trading strategy

The spread information is used to develop a simple trading strategy to determine when to buy or sell the stocks.

We look at the value of the normalized spread and when the spread significantly deviated from the mean of the spread,

we take a short position in the overvalues stock and a long position in the undervalued stock.

In [61]:
# Test data frame for testing trading strategy
test_data = pd.DataFrame({'P1':pair_data['P1_Close'].iloc[-len(testX):],'P2':pair_data['P2_Close'].iloc[-len(testX):]})
test_data['Actual_Spread'] = pair_data['Spread_Close'].iloc[-len(testX):]
test_data['Kalman_Predicted_Spread']  = yhat_Kalman
test_data['ARIMA_Predicted_Spread']  = yhat_ARIMA
test_data['LSTM_Predicted_Spread'] = list(yhat_LSTM[:,0])
data = test_data['Actual_Spread']
In [62]:
moving_avg_5day= data.rolling(window=5,center=False).mean()

moving_avg_60day = data.rolling(window=60,center=False).mean()

std_60day = data.rolling(window=60,
                        center=False).std()

zscore_60_5 = (moving_avg_5day - moving_avg_60day)/std_60day
plt.figure(figsize=(15,7))
plt.plot(data.index, data.values)
plt.plot(moving_avg_5day.index, moving_avg_5day.values)
plt.plot(moving_avg_60day.index, moving_avg_60day.values)

plt.legend(['Spread','5d Ratio Moving Avg', '60d Ratio Moving Avg'])

plt.ylabel('Spread')
plt.show()
In [63]:
plt.figure(figsize=(15,7))
zscore_60_5.plot()
plt.axhline(0, color='black')
plt.axhline(1.0, color='red', linestyle='--')
plt.axhline(-1.0, color='green', linestyle='--')
plt.legend(['Rolling Spread z-Score', 'Mean', '+1', '-1'])
plt.show()
In [64]:
# Plot the ratios and buy and sell signals from z score
plt.figure(figsize=(15,7))
data[60:].plot()
#separating buy sell signals based on z_scores
buy = data.copy()
sell = data.copy()
buy[zscore_60_5>-1] = -100
sell[zscore_60_5<1] = -100
buy[60:].plot(color='g', linestyle='None', marker='^')
sell[60:].plot(color='r', linestyle='None', marker='^')
x1,x2,y1,y2 = plt.axis()
plt.axis((x1,x2,data.min(),data.max()))
plt.legend(['Spread', 'Buy Signal', 'Sell Signal'])
plt.show()
In [65]:
# Plot the prices and buy and sell signals from z score
plt.figure(figsize=(18,9))
P1 = test_data.P1
P2 = test_data.P2

P1[60:].plot(color='b')
P2[60:].plot(color='c')
buyR = 0*P1.copy()
sellR = 0*P1.copy()

# When buying the ratio, buy S1 and sell S2
buyR[buy!=-100] = P1[buy!=-100]
sellR[buy!=-100] = P2[buy!=-100]
# When selling the ratio, sell S1 and buy S2 
buyR[sell!=-100] = P2[sell!=-100]
sellR[sell!=-100] = P1[sell!=-100]

buyR[60:].plot(color='g', linestyle='None', marker='^')
sellR[60:].plot(color='r', linestyle='None', marker='^')
x1,x2,y1,y2 = plt.axis()
plt.axis((x1,x2,min(P1.min(),P2.min()),max(P1.max(),P2.max())))

plt.legend(['P1', 'P2', 'Buy Signal', 'Sell Signal'])
plt.show()
In [66]:
P1 = test_data.P1
P2 = test_data.P2


buyR = 0*P1.copy()
sellR = 0*P1.copy()

# When buying the ratio, buy S1 and sell S2
buyR[buy!=-100] = P1[buy!=-100]
sellR[buy!=-100] = P2[buy!=-100]
# When selling the ratio, sell S1 and buy S2 
buyR[sell!=-100] = P2[sell!=-100]
sellR[sell!=-100] = P1[sell!=-100]
fig = px.line(x=test_data[60:].index, y=[P1[60:].values, P2[60:].values], 
              title='Spread true vs predictied -- LSTM', 
              labels={'index':'Date', 'value':'Closing Price'})

fig.add_scatter(x=buyR[60:].index, y=buyR[60:].values, mode='markers', marker_color='green', 
                marker_symbol='triangle-up',marker_size=10)
fig.add_scatter(x=sellR[60:].index, y=sellR[60:].values, mode='markers', marker_color='red', 
                marker_symbol='triangle-up',marker_size=10)

# names = cycle(['P1', 'P2', 'Buy Signal', 'Sell Signal'])
fig.update_traces(line=dict(width=2.5))
# fig.update_layout(legend_title="")
fig.update_yaxes(range=[min(P1.min(),P2.min()),max(P1.max(),P2.max())])

fig.show()

Estimating profit using the developed trading strategy

In [67]:
def trade(Stock1, Stock2, spread, window1, window2):
    # If window length is 0, algorithm doesn't make sense, so exit
    if (window1 == 0) or (window2 == 0):
        return 0
    # Compute rolling mean and rolling standard deviation
    moving_avg1 = spread.rolling(window=window1, center=False).mean()
    moving_avg2 = spread.rolling(window=window2, center=False).mean()
    std = spread.rolling(window=window2, center=False).std()
    zscore = (moving_avg1 - moving_avg2)/std
    # Simulate trading
    # Start with no money and no positions
    money = 0
    trade_count_S1 = 0
    trade_count_S2 = 0
    for i in range(len(spread)):
        # Sell short if the z-score is > 1
        if zscore[i] > 1:
            money += Stock1[i] - Stock2[i] * spread[i]
            trade_count_S1 -= 1
            trade_count_S2 += spread[i]
        # Buy long if the z-score is < 1
        elif zscore[i] < -1:
            money -= Stock1[i] - Stock2[i] * spread[i]
            trade_count_S1 += 1
            trade_count_S2 -= spread[i]
        # Clear positions if the z-score between -.5 and .5
        elif abs(zscore[i]) < 0.5:
            money += (trade_count_S1 * Stock1[i]) - (Stock2[i] * trade_count_S2)
            trade_count_S1 = 0
            trade_count_S2 = 0
    return money
In [68]:
data = test_data['Actual_Spread'] 
profit = trade(test_data['P1'], test_data['P2'], data, 30, 5)
profit
Out[68]:
82840.63369032505